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Abstract We present a video summarization approach for 
egocentric or “wearable” camera data. Given hours of video, 
the proposed method produces a compact storyboard sum¬ 
mary of the camera wearer’s day. In contrast to traditional 
keyframe selection techniques, the resulting summary fo¬ 
cuses on the most important objects and people with which 
the camera wearer interacts. To accomplish this, we develop 
region cues indicative of high-level saliency in egocentric 
video—such as the nearness to hands, gaze, and frequency 
of occurrence—and learn a regressor to predict the relative 
importance of any new region based on these cues. Using 
these predictions and a simple form of temporal event de¬ 
tection, our method selects frames for the storyboard that 
reflect the key object-driven happenings. We adjust the com¬ 
pactness of the final summary given either an importance se¬ 
lection criterion or a length budget; for the latter, we design 
an efficient dynamic programming solution that accounts for 
importance, visual uniqueness, and temporal displacement. 
Critically, the approach is neither camera-wearer-specific 
nor object-specific; that means the learned importance met¬ 
ric need not be trained for a given user or context, and it 
can predict the importance of objects and people that have 
never been seen previously. Our results on two egocentric 
video datasets show the method’s promise relative to exist¬ 
ing techniques for saliency and summarization. 
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1 Introduction 

The goal of video summarization is to produce a compact 
visual summary that encapsulates the key components of a 
video. Its main value is in turning hours of video into a short 
summary that can be interpreted by a human viewer in a 
matter of seconds. Automatic video summarization meth¬ 
ods would be useful for a number of practical applications, 
such as analyzing surveillance data, video browsing, ac¬ 
tion recognition, or creating a visual diary of one’s personal 
lifelog video. 

Existing methods extract keyframes t H H EE ). create 
montages of still images (5j6), or generate compact dynamic 
summaries 17151 . Despite promising results, they assume a 
static background or rely on low-level appearance and mo¬ 
tion cues to select what will go into the final summary. How¬ 
ever, in many interesting settings, such as egocentric videos, 
YouTube style videos, or feature films, the background is 
moving and changing. More critically, a system that lacks 
high-level information on which objects matter may produce 
a summary that consists of irrelevant frames or regions. In 
other words, existing methods are indifferent to the impact 
that each object has on generating the “story” of the video. 

In this work, we are interested in creating object-driven 
summaries for videos captured from a wearable camera. An 
egocentric video offers a first-person view of the world that 
cannot be captured from environmental cameras. For exam¬ 
ple, we can often see the camera wearer’s hands, or find the 
object of interest centered in the frame. Essentially, a wear¬ 
able camera focuses on the user’s activities, social interac¬ 
tions, and interests. We aim to exploit these properties for 
egocentric video summarization. 

Good summaries for egocentric data would have wide 
potential uses. Not only would recreational users (includ¬ 
ing “life-loggers”) find it useful as a video diary, but there 
are also high-impact applications in law enforcement, elder 
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Input: Egocentric video of the camera wearer's day 
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1:00 pm 2:00 pm 3:00 pm 4:00 pm 5:00 pm 6:00 pm 

Output: Storyboard summary of important people and objects 


Fig. 1 Given an unannotated egocentric video, our method produces 
a compact storyboard visual summary that focuses on the key people 
and objects. 


and child care, and mental health. For example, the sum¬ 
maries could facilitate police officers in reviewing impor¬ 
tant evidence, suspects, and witnesses, or aid patients with 
memory problems to remember specific events, objects, and 
people nm Furthermore, the egocentric view translates 
naturally to robotics applications—suggesting, for example, 
that a robot could summarize what it encounters while nav¬ 
igating unexplored territory, for later human viewing. 

Motivated by these problems, we propose an approach 
that learns category-independent importance cues designed 
explicitly to target the key objects and people in the video. 
The main idea is to leverage novel egocentric and high-level 
saliency features to train a model that can predict important 
regions in the video, and then to produce a concise visual 
summary that is driven by those regions (see Fig. 0 - By 
learning to predict important regions, we can focus the vi¬ 
sual summary on the main people and objects, and ignore 
irrelevant or redundant information. 

Our method works as follows. We first train a regression 
model from labeled training videos that scores any region’s 
likelihood of belonging to an important person or object. For 
the input variables, we develop a set of high-level cues to 
capture egocentric importance, such as frequency, proxim¬ 
ity to the camera wearer’s hand, and object-like appearance 
and motion, as well as a set of low-level cues to capture re¬ 
gion properties such as size, width, and height. The target 
variable is the overlap with ground-truth important regions, 
i.e., the importance score. Given a novel video, we use the 
model to predict important regions for each frame. We then 
partition the video into unique temporal events , by cluster¬ 
ing scenes that have similar color distributions and are close 
in time. For each event, we isolate unique representative in¬ 
stances of each important person or object. Finally, we pro¬ 
duce a storyboard visual summary that displays the most im¬ 
portant objects and people across all events in the camera 
wearer’s day. 

We propose two ways to adjust the compactness of the 
summary, based on either a target importance criterion or a 
target summary length. For the latter, we design an energy 


function that accounts for the importance of the selected 
frames, their visual dissimilarities, and their temporal dis¬ 
placements, and can be efficiently optimized using dynamic 
programming. 

We emphasize that we do not aim to predict importance 
for any specific category (e.g., cars). Instead, we learn a 
general model that can predict the importance of any ob¬ 
ject instance, irrespective of its category. This category- 
independence avoids the need to train importance predictors 
specific to a given camera wearer, and allows the system to 
recognize as important something it has never seen before. 
In addition, it means that objects from the same category can 
be predicted to be (un)important depending on their role in 
the story of the video. For example, if the camera wearer 
has lunch with his friend Jill, she would be considered im¬ 
portant, whereas people in the same restaurant sitting around 
them could be unimportant. Then, if they later attend a party 
but chat with different friends, Jill may no longer be consid¬ 
ered important in that context. 

Our main contribution is an egocentric video summa¬ 
rization approach that is driven by predicted important peo¬ 
ple and objects. Towards this goal, we develop two primary 
technical ideas. In the first, we develop a learning approach 
to estimate region importance using novel cues designed 
specifically for the egocentric video setting. In the second, 
we devise an efficient keyframe selection strategy that cap¬ 
tures the most important objects and people, subject to meet¬ 
ing a budget for the desired length of the output storyboard. 

We apply our method to challenging real-world videos 
captured by users in uncontrolled environments, and process 
a total of 27 hours of video—significantly more data than 
previous work in egocentric analysis. Evaluating the pre¬ 
dicted importance estimates and summaries, we find our ap¬ 
proach outperforms state-of-the-art high-level and low-level 
saliency measures for this task, and produces significantly 
more informative summaries than traditional methods. 

This article expands upon our previous conference pa¬ 
per ED in terms of the method design, experiments, and 
presentation. In Sections |3.6.2| and |4.5| we introduce and 
analyze a novel budgeted frame selection approach that ef¬ 
ficiently produces fixed-length summaries. In Section [4j we 
add new comparisons to multiple existing video summariza¬ 
tion methods, analyze object prominence in the summaries, 
conduct new user studies with over 25 users to systemat¬ 
ically gauge the summaries’ quality, and produce new re¬ 
sults on the Activities of Daily Living dataset ED Finally, 
throughout we provide more detailed algorithm explanations 
(including Figures [2] [3} and [5]). 


2 Related Work 

Video summarization: Static keyframe methods compute 
motion stability from optical flow |T| or global scene 
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color/texture differences I2ll4lfl3ll to select the frames that 
go into the summary. The low-level approach means that 
irrelevant frames can often be selected, which is particu¬ 
larly problematic for our application of summarizing hours 
of continuous egocentric video that contain lots of irrele¬ 
vant data. By generating object-driven summaries, we aim 
to move beyond such low-level cues. 

Video summarization can also take the form of a sin¬ 
gle montage of still images. Existing methods take a 
background reference frame and project in foreground re¬ 
gions 0 , or sequentially display automatically selected key- 
poses 0- An interactive approach 0 takes user-selected 
frames and key points, and generates a storyboard that con¬ 
veys the trajectory of an object. These approaches generally 
assume short clips with few objects, or a human-in-the-loop 
to guide the summarization process. In contrast, we aim to 
summarize a camera wearer’s day containing hours of con¬ 
tinuous video with hundreds of objects, with no human in¬ 
tervention. 


Compact dynamic summaries simultaneously show sev¬ 
eral spatially non-overlapping actions from different times 
of the video ED. While that framework aims to focus 
on foreground objects, it assumes a static camera and is 
therefore inapplicable to egocentric video. A re-targeting ap¬ 
proach aims to simultaneously preserve an original video’s 
content while reducing artifacts (14) . but unlike our ap¬ 
proach, does not attempt to characterize the varying degrees 
of object importance. In a semi-automatic method m, ir- 
relevant video frames are removed by detecting the main ob¬ 
ject of interest given a few user-annotated training frames. 
In contrast, our approach automatically discovers multiple 
important objects. 

Saliency detection: Early saliency detectors rely on 
bottom-up image cues (e.g., 11611171 ). More recent work 
tries to learn high-level saliency measures using various 
Gestalt cues, whether for static images 1181119112011211 or 
video (22). Whereas typically such metrics aim to prime 
a visual search process, we are interested in high-level 
saliency for the sake of isolating those things worth summa¬ 
rizing. Researchers have also explored ranking object im¬ 
portance in static images, learning what people mention first 
from human-annotated tags (23112411. In contrast, we learn 
the importance of objects in terms of their role in a long-term 
video’s story. Relative to any of the above, we introduce 
novel saliency features amenable to the egocentric video set¬ 
ting. 

Egocentric visual data analysis: Vision researchers 
have recently returned to exploring egocentric visual anal¬ 
ysis, prompted in part by increasingly portable wearable 
cameras. Early work with wearable cameras partition visual 
and audio data into events (25), or uses supervised learn¬ 
ing for specialized tasks like sign language recognition (26) 
or location recognition within a building (27) . Methods in 


ubiquitous computing use manual intervention [28 ] or ex¬ 
ternal non-visual sensors 1291301 (e.g., skin conductivity or 
audio) to trigger snapshots from a wearable camera. Oth¬ 
ers use brain waves ED, k-means clustering with tempo¬ 
ral constraints (32) , or face detection (33) to segment ego¬ 
centric videos. Recent methods explore activity recogni¬ 
tion 1341135111211361 , handled object recognition (371 , novelty 
detection (38), hand detection (39) , gaze prediction (40) , so¬ 
cial interaction analysis ED, or activity discovery for non¬ 
visual sensory data (42). Unsupervised algorithms are de¬ 
veloped to discover scenes ED and actions El, or select 
keyframes (45), based on low-level visual features extracted 
from egocentric data. In contrast to all these methods, we 
aim to build a visual summary, and model high-level impor¬ 
tance of the objects present. 

To our knowledge, we are the first to explore visual sum¬ 
marization of egocentric video by predicting important ob¬ 
jects. Recent work (46) builds on our approach and uses 
our importance predictions as a cue to generate story-driven 
egocentric video summarizations. 


3 Approach 

Our goal is to create a storyboard summary of a person’s 
day that is driven by the important people and objects. The 
video is captured using a wearable camera that continuously 
records what the user sees. We define importance in the 
scope of egocentric video: important things are those with 
which the camera wearer has significant interaction. This 
is reasonable for the egocentric setting, since the camera 
wearer is likely to engage in social activities with cliques 
of people (e.g., friends, co-workers) that involve interac¬ 
tions with specific objects (e.g., food, computer). The cam¬ 
era wearer will typically find these people and objects to be 
memorable, as we confirm in our user studies in Section |46l 

There are four main steps to our approach: (1) us¬ 
ing novel egocentric saliency cues to train a category- 
independent regression model that predicts how likely it is 
that an image region belongs to an important person or ob¬ 
ject; (2) partitioning the video into temporal events. For each 
event, (3) scoring each region’s importance using the regres¬ 
sor; and (4) selecting representative key-frames for the sto¬ 
ryboard that encapsulate the predicted important people and 
objects, either using a user-specified importance criterion or 
a length budget. 

We first describe how we collect the video data and 
ground-truth annotations needed to train our model. We then 
describe each of the main steps in turn. 
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(a) 


(b) 


(c) 





Name the important objects/people in the video. 


1. Watch the video. The video is 3 minutes long. 


2. Describe every visible important object/person in the video. There 
will be 1 to 5 important objects/people. These are key items/players that 
are essential to the "story"; i.e., things that would be necessary to create 
a summary of the video. For example, objects/people that frequently 
appear, objects/people that the camera wearer interacts with, are some 
things that could be considered important. 


Draw the boundaries of the described objects as accurately as possible. 

1. Read the text descriptions of the objects to find. These are the 
objects you are looking for in the images! 

2. Find those objects in the image below. If the image does not contain 
any of the described objects, check the button "objects absent". 
Otherwise, click "start drawing" to draw a tight-fitting boundary of one 
described object in the image. If the image contains multiple described 
objects, select the most central, prominent one to annotate. 


Objects to find: 

(1) black pot with rice 

(2) white rice cooker 

(3) television 

(4) man in glasses 


Man wearing a blue shirt in cafe 


Yellow notepad on table 


Coffee mug that cameraman drinks Iphone that the cameraman holds Camera wearer cleaning the plates 


Fig. 2 Our Mechanical Turk interfaces for important person/object (a) text description and (b) annotation, and (c) example annotations that we 
obtained. The important people and objects are annotated. 


3.1 Egocentric video data collection 

We use the Looxcie wearable camera, which captures video 
at 15 fps at 320 x 480 resolution. It is worn around the ear 
and looks out at the world at roughly eye-level. We collected 
10 videos from four subjects, each three to five hours in 
length (the maximum battery life), for a total of 37 hours 
of video. We call this the UT Egocentric (UT Ego) dataset. 
Our data is publicly available^] 

Four subjects wore the camera for us: one undergraduate 
student, two grad students, and one office worker, ranging 
in age from early to late 20s and both genders. The different 
backgrounds of the subjects ensure diversity in the data— 
not everyone’s day is the same—and is critical for validating 
the category-independence of our approach. We asked the 
subjects to record their natural daily activities, and explicitly 
instructed them not to stage anything for this purpose. The 
videos capture a variety of activities such as eating, shop¬ 
ping, attending a lecture, driving, cooking, and working on 
a computer. 

3.2 Annotating important regions in video 

To train the importance predictor, we first need ground-truth 
training examples. In general, determining whether an ob- 

1 http://vision.cs.utexas.edu/projects/egocentric/ 

Due to privacy issues, we are only able to share 4 of the 10 videos (one from each 
subject), for a total of 17 hours of video. They correspond to the test videos that we 
evaluate on in Sec. [4] 


ject is important or not can be highly subjective. Fortunately, 
an egocentric video provides many constraints that are sug¬ 
gestive of an object’s importance. For example, one can ob¬ 
serve the camera wearer’s hands, and an object of interest 
may often be centered in the frame. 

In order to learn meaningful egocentric properties with¬ 
out overfitting to any particular category, we crowd-source 
annotations using Amazon’s Mechanical Turk (MTurk). For 
egocentric videos, an object’s degree of importance will de¬ 
pend on what the camera wearer is doing before, while, and 
after the object or person appears. In other words, the object 
must be seen in the context of the camera wearer’s activity 
to properly gauge its importance. 

We carefully design two annotation tasks to capture this 
aspect. In the first task, we ask workers to watch a three 
minute accelerated video (equivalent to 10 minutes of orig¬ 
inal video) and to describe in text what they perceive to be 
essential people or objects necessary to create a summary of 
the video. In the second task, we display uniformly sampled 
frames from the video and their corresponding text descrip¬ 
tions obtained from the first task , and ask workers to draw 
polygons around any described person or object. If none of 
the described objects are present in a frame, the annotator is 
given the option to skip it. See Fig. [2] for the two interfaces 
and example annotations. 

We found this two-step process more effective than a sin¬ 
gle task in which the same worker both watches the video 
and then annotates the regions s/he deems important, likely 
due to the time required to complete both tasks. Critically, 
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distance to hand distance to frame center frequency 



object-like appearance, motion overlap w/face detection 

Fig. 3 Illustration of our (a) egocentric features and (b) object features. 


the two-step process also helps us avoid bias: a single an¬ 
notator asked to complete both tasks at once may be biased 
to pick easier things to annotate rather than those s/he finds 
to be most important. Our setup makes it easy for the first 
worker to freely describe the objects without bias, since s/he 
only has to enter text. We found the resulting annotations 
quite consistent, and only manually pruned those where the 
region outlined did not agree with the first worker’s descrip¬ 
tion. For a 3-5 hour training video, we obtain roughly 35 text 
descriptions and 700 object segmentations. 


3.3 Learning egocentric region importance 

We now discuss the procedure to train a general purpose 
category-independent model that will predict important re¬ 
gions in any egocentric video, independent of the camera 
wearer. Given a video, we first generate candidate regions 
for each frame using a min-cut method f20j which tends to 
avoid oversegmenting objects. We represent objects at the 
frame-level, since our uncontrolled setting usually prohibits 
reliable space-time object segmentation due to frequent and 
rapid head movements by the camera wearer. We generate 
roughly 800 regions per frame. 

For each region, we compute a set of candidate features 
that could be useful to describe its importance. Since the 
video is captured by an active participant, we specifically 
want to exploit egocentric properties such as whether the ob- 
ject/person is interacting with the camera wearer, whether it 
is the focus of the wearer’s gaze, and whether it frequently 
appears. In addition, we aim to capture high-level saliency 
cues—such as an object’s motion and appearance, or the 
likelihood of being a human face—and generic region prop¬ 
erties shared across categories, such as size or location. We 
describe the proposed features in detail next. 


3.3.1 Feature definitions 

Egocentric features: Fig.|3](a) illustrates the three proposed 
egocentric features. To model interaction, we compute the 


Euclidean distance of the region’s centroid to the closest de¬ 
tected hand in the frame. Given a frame in the test video, we 
first classify each pixel as (non-)skin using color likelihoods 
and a Naive Bayes classifier m trained with ground-truth 
hand annotations on disjoint data. We then classify any su¬ 
perpixel (computed using (48l ) as hand if more than 25% of 
its pixels are skin. While simple, we find this hand detector 
is sufficient for our application. More sophisticated methods 
(e.g., (49)) would certainly be possible as well. 

To model gaze, we compute the Euclidean distance of 
the region’s centroid to the frame center. Since the camera 
moves with the wearer’s head, this is a coarse estimate of 
how likely the region is being focused upon. 

To model frequency, we record the number of times an 
object instance is detected within a short temporal segment 
of the video. We create two frequency features: one based 
on matching regions, the other based on matching points. 
For the first, we compute the color dissimilarity between a 
region r and each region r n in its surrounding frames, and 
accumulate the total number of positive matches: 

^region (j’') = ^ ^ X (G ^0^ — ? (1) 

few 

where / indexes the set of frames W surrounding region 
r’s frame, x 2 (g r n ) is the x 2 -distance between color his¬ 
tograms of r and r n , 6 r is the distance threshold to determine 
a positive match, and [•] denotes the indicator function. The 
value of c reg i on will be high/low when r produces many/few 
matches (i.e., is frequent/infrequent). 

The second frequency feature is computed by matching 
Difference of Gaussian SIFT interest points. For a detected 
point p in region r, we match it to all detected points in each 
frame / E W, and count as positive those that pass the ratio 
test (50) . We repeat this process for each point in region r, 
and record their average number of positive matches: 


Cpoint (x) 


1 P 

i= i few 


d(Pi,P f i*) < q 
d{Pi,P f 2 .) - 


( 2 ) 


where i indexes all detected points in region r, d(p^,p{*) 
and d(pi,pl*) measure the Euclidean distance between pi 
and its best matching point p{ * and second best matching 
point p{* i n frame /, respectively, and 9 P is Lowe’s ratio 
test threshold for non-ambiguous matches |[50l . The value of 
Cpoint will be high/low when the SIFT points in r produce 
many/few matches. For both frequency features, we set W 
to span a 10 minute temporal window. 

Object features: In addition to the egocentric-specific 
features, we include three high-level (i.e., object-based) 
saliency cues (see Fig.[3](b)). To model object-like appear¬ 
ance, we use the learned region ranking function of (20) . 
It reflects Gestalt cues indicative of any object, such as the 
sum of affinities along the region’s boundary, its perimeter, 
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and texture difference with nearby pixels. (Note that the au¬ 
thors trained their measure on PASCAL data, which is dis¬ 
joint from ours.) We stress that while this feature estimates 
how “object-like” a region is, it does not gauge importance. 
It is useful for identifying full object segments, as opposed 
to fragments. 

To model object-like motion, we develop a key- 
segments video segmentation descriptor l22l . It looks at 
the difference in motion patterns of a region relative to its 
closest surrounding regions. Specifically, we compare opti¬ 
cal flow histograms for the region and the pixels around it 
within a loosely fit bounding box. Note that this feature is 
not simply looking for large motions or appearance changes 
from background. Rather, we are describing how the motion 
of the region differs from its closest surrounding regions; 
this allows us to forgo assumptions about camera motion, 
and also to be sensitive to different magnitudes of motion. 
Similar to the appearance feature above, it is useful for se¬ 
lecting object-like regions that “stand-out” from their sur¬ 
roundings. 

To model the likelihood of a person’s face, we compute 
the maximum overlap score between the region r and 
any detected frontal face q in the frame, using ED. 

Region features: Finally, we compute the region’s 
size, centroid, bounding box centroid, bounding box 
width, and bounding box height. They reflect category- 
independent importance cues and are blind to the region’s 
appearance or motion. We expect that important people and 
objects will occur at non-random scales and locations in the 
frame, due to social and environmental factors that constrain 
their relative positioning to the camera wearer (e.g., sitting 
across a table from someone when having lunch, or han¬ 
dling cooking utensils at arm’s length). Our region features 
capture these statistics. 

Altogether, these cues form a 14-dimensional feature 
space to describe each candidate region (4 egocentric, 3 ob¬ 
ject, and 7 region feature dimensions). 


3.3.2 Regressor to predict region importance 

Using the features defined above, we next train a model that 
can predict a region’s importance. The model should be able 
to learn and predict a region’s degree of importance instead 
of whether it is simply “important” or “not important”, so 
that we can meaningfully adjust the compactness of the final 
summary (as we demonstrate in Section [4]). Thus, we opt to 
train a regressor rather than a classifier. 

While the features defined above can be individually 
meaningful, we also expect significant interactions between 
the features. For example, a region that is near the camera 
wearer’s hand might be important only if it is also object¬ 
like in appearance. Therefore, we train a linear regression 
model with pair-wise interaction terms to predict a region 


r s importance score : 

N 


N N 


I{r)=po + Y,PiXi(r)+J2 PijXi(r)xj(r), (3) 

i=l j=i +1 




where the /?’s are the learned parameters, xfr) is the ith 
feature value, and N = 14 is the total number of features. 
For training, we define a region r’s target importance 


with any ground-truth 


score by its maximum overlap 
region GT in a training video obtained from Section |T2 


\GTC\r\ 

\GTUr\ 


Thus, regions with perfect overlap with ground-truth will 
have a target importance score of 1, those with no overlap 
with ground-truth will have an importance score of 0, and all 
others will have an importance score in (0,1). We standard¬ 
ize the features to zero-mean and unit-variance, and solve 
for the /3’s using least-squares. For testing, our model takes 
as input a region r’s features (the xfs) and predicts its im¬ 
portance score /(r). Note that we train and test using video 
from different users to avoid overfitting our model to any 
specific camera wearer. 


3.4 Segmenting the video into temporal events 

Given a new video, we first partition the video temporally 
into events, and then isolate the important people and ob¬ 
jects in each event. Events allow the final summary to in¬ 
clude multiple instances of an object/person that is central 
in multiple contexts in the video. For example, suppose that 
the camera wearer plays with her dog at home in the morn¬ 
ing and later takes the dog out to the park at night. We can 
treat the two instances of the dog as different objects (since 
they appear in different events) and include both in the final 
summary. Moreover, events indicate which selected frames 
are more related to one another, giving a hierarchical struc¬ 
ture to the final summary. 

While shot boundary detection has been frequently used 
to perform event segmentation for videos, it is impractical 
for our wearable camera data setting. Traditional shot detec¬ 
tion generally assumes visual continuity and thus tends to 
oversegment egocentric events due to frequent head move¬ 
ments. Instead, we detect egocentric events by clustering 
scenes in such a way that frames with similar global ap¬ 
pearance can be grouped together even when there are a few 
unrelated frames (“gaps”) between them. 

Let V denote the set of all video frames. We compute a 
pairwise distance matrix Dy between all frames f m , f n E 
V, using the distance: 

D(fm,f n ) = 1 - «Ln e Xp(“X 2 (/m,/n)), (4) 

where = | max(0, t — \m — n\), t is the size of the 

temporal window surrounding frame / m , x 2 (/™, fn) is the 
X 2 -distance between color histograms of f m and / n , and 
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Fig. 4 Distance matrix that measures global color dissimilarity be¬ 
tween all frames. (Blue/red reflects high/low distance.) The images 
show representative frames of each discovered event. The block struc¬ 
ture along the diagonal reveals groups of frames that are close in ap¬ 
pearance and time. 


Q denotes the mean of the x 2 -distances among all frames. 
Thus, frames similar in color receive a low distance, subject 
to a weight that discourages frames too distant in time from 
being grouped. 

We next perform complete-link agglomerative clustering 
with Z7y, grouping frames until the smallest maximum inter¬ 
frame distance is larger than two standard deviations beyond 
Q. The first and last frames in a cluster determine the start 
and end frames of an event, respectively. Fig. [4] shows the 
distance matrix computed for one subject’s day, and the rep¬ 
resentative frames for each discovered event. 


3.5 Discovering an event’s key people/objects 


For each event, we aim to select the important people and 
objects that will go into the final summary, while avoiding 
redundancy. Recall that objects are represented at the frame- 


level (Section [33] ). Thus, our goal is to group together in¬ 
stances of the same person or object that appear over time in 
each event. 

Given an event, we first score each bottom-up segment 
in each frame using our regressor. Since we do not know a 
priori how many important things an event contains, we gen¬ 
erate a candidate pool of clusters from the set C of bottom-up 
regions, and then remove any redundant clusters, as follows. 

To extract the candidate groups, we first compute an 
affinity matrix Kc over all pairs of regions r m ,r n E 
C, where affinity is determined by color similarity: 
K C { r rm r n ) = exp(- ^x 2 (r m , r n )), where r denotes the 
mean x 2 -distance among all pairs in C. We next partition 
Kc into multiple (possibly overlapping) inlier/outlier clus¬ 
ters using a factorization approach |[52ll . The method finds 
tight sub-graphs within the input affinity graph while resist¬ 
ing the influence of outliers. Each resulting sub-graph con¬ 
sists of a candidate important object’s instances. To reduce 
redundancy, we sort the sub-graph clusters by the average 
I(r) of their member regions, and remove those with high 



Event A 



Select representative region with highest I(r) 



Fig. 5 Discovering an event’s key people and objects. For each event, 
we group together regions that are likely to belong to the same object, 
and then for each group, we select the region with the highest impor¬ 
tance score as its representative. 


affinity to a higher-ranked cluster. Finally, for each remain¬ 
ing cluster, we select the region with the highest importance 
score as its representative (see Fig. [5]). 


3.6 Generating a storyboard summary 

Finally, we create a storyboard visual summary of the video. 
We display the event boundaries and frames of the selected 
important people and objects (see Fig. ED- Each event can 
display a varying number of frames, depending on how 
many unique important things our method discovers. 

We propose two ways to adjust the compactness of the 
summary: (1) according to a target importance criterion, and 
(2) according to a target summary length. We describe each 
process in detail next. 


3.6.1 Summarization given an importance criterion 

We first describe how to summarize the video given an im¬ 
portance criterion. This allows the system to automatically 
produce the most compact summary possible that encapsu¬ 
lates only the people and objects that meet the importance 
threshold. 

When discovering an event’s key people and objects 
(Section we take only those regions that have impor¬ 
tance scores higher than the specified criterion to form set 
C. We then proceed to group instances of the same person 
or object together in C, and select the frame with the highest 
scoring region in each group to go into the summary. 


3.6.2 Summarization given a length budget 

Alternatively, we can summarize the video given a length 
budget k. This allows the system to answer requests such as, 
“Generate a 5-minute summary.” We formulate the objec¬ 
tive as a k-frame selection problem and define the following 
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energy function: 

151 |shl / i 

E(S) = ex P{-ftX 2 (fs i Js i+ i)J 

i=1 i= 1 

| 5|-1 

- ( 5 ) 

i=1 

where S = {si,..., is the set of indices of the k se¬ 
lected frames, and Q is the mean of the x 2 -distances among 
all frames. 

There are three terms in our energy function. The first 
term enforces selection of important frames, since we want 
the summary to contain the discovered important people and 
objects. We score each frame using the region that has the 
highest importance score: I(f s J = rnax m /(r m ^), where 
is the rath region in frame i. Our second term enforces 
visual uniqueness, i.e., that adjacent selected frames contain 
different objects. We want the summary to avoid includ¬ 
ing redundant frames. Thus, we compute an affinity based 
on the x 2 -distance between color histograms of adjacent 
frames f s . and f Si+1 . Finally, our last term enforces selec¬ 
tion of frames that are spread out in time such that the sum¬ 
mary best captures the entire “story” of the original video. 
For this, we compute the difference in frame index of the 
selected frames. Note that I s i ~~ s i+ 1|^ achieves a 

maximum when the temporal distances between all adjacent 
frames |s* — Si+i| are equal. 

We compute the optimal set S* of k frames by finding 
the set that minimizes Eqn. [5] 

S* = argmin E(S ), (6) 

Scv 

where V is the set of frames of the selected important people 
and objects from Section [33] 

A naive approach for optimizing Eqn. [6] would take time 
O(Q) for F = \V\ total frames. Instead, we efficiently 
find the optimal set S* using dynamic programming, by ex¬ 
ploiting the optimal substructure that exists in the k- frame 
selection problem. 

Specifically, the minimum energy M(/ n , t) of a t -length 
summary that selects frame f n at time step t can be recur¬ 
sively computed as follows: 

M(fn, t) t <n<F-k + t = 


J-/(/ n ), if t = 1. 

| -I(fn) + min (e(/ m , f n ) + M(/ m , t - 1)), if 1 < t < k, 

(7) 

where p = t - 1, q = F- k + t + l, and e(/ m , f n ) = 
exp(— ^X 2 (/m 5 fn)) ~ \ m — n\ 2 . We enforce the selected 
set of frames to be a temporally ordered subsequence of the 
original video: Si < Vi. Thus, any “path” that does not 
obey this rule is assigned infinite cost. 


Input: Egocentric video, and importance selection criterion or 
length budget k. 

Output: Storyboard summary. 

1. Train regression model. (Sec. [33} 

2. Segment video into temporal events. (Sec. |3.4) 

For each event, 

3. Compute I(r) for all regions. (Sec. |3.3} 

4. Group regions that belong to same person/object. (Sec. |3.5) 

5. Retain unique clusters, select most important region in each 
group. (Sec. |3.5) 

6. Generate storyboard summary that shows selected important 
people/objects. (Sec. |3.6) 

Algorithm 1: Our summarization approach 


Using Eqn.[7J we can compute the minimum energy for a 
^-length summary as E(S*) = min n M(/ n , k), which can 
be solved in 0(F 2 k ) time. We retrieve the optimal set of k 
frames S* by backtracking from f n at time k. 

3.6.3 Discussion 

The two strategies presented above offer certain trade-offs. 
The importance criterion automatically produces the most 
compact summary possible that includes all unique in¬ 
stances of the important people and objects; however, it does 
not give the user direct control on the length of the output 
summary. In contrast, while the proposed budgeted formu¬ 
lation can return a storyboard of a specified length, it does 
not permit setting an absolute threshold on how important 
objects must be for inclusion. 

In addition to being a compact video diary of one’s day, 
our storyboard summary can be considered as a visual in¬ 
dex to help a user peruse specific parts of the video. This 
would be useful when one wants to relive a specific moment 
or search for less important people or objects that occurred 
with those found by our method. 

Alg.[l]recaps all the steps of our approach. 


4 Results 


In this section we evaluate our approach on our new UT 
Egocentric (UT Ego) dataset and on the Activities of Daily 
Living (ADL) dataset G2. which consists of 17 and 10 
hours of egocentric video, respectively. We offer direct com¬ 
parisons to existing methods for both saliency and video 
summarization, and we perform a user study with over 25 
subjects to quantify the perceived quality of our results. We 
use UT Ego for the experiments in Sections [431-6, and ADL 


for the experiments in Section 4.7 


4.1 Dataset and implementation details 

For our UT Ego dataset, we collected 10 videos from four 
subjects, each 3-5 hours long. 1 Each person contributed one 
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Fig. 7 Example selected regions/frames. The first four columns show examples of correct predictions made by our approach, and the last four 
columns show failure cases in which the high-level saliency methods 1 2012 T 1 make better predictions. 


■Important (Ours): 0.26 



Recall 


Fig. 6 Precision-Recall for important object prediction. Numbers 
in the legends denote average precision. By leveraging egocentric- 
specific cues, our approach more accurately discovers the important 
regions. 


video, except one who contributed seven. The videos are 
challenging due to frequent camera viewpoint/illumination 
changes and motion blur. For evaluation, we use four data 
splits: for each split we train with data from three users and 
test on one video from the remaining user. Hence, the cam¬ 
era wearers in any given training set are disjoint from those 
in the test set, ensuring we do not learn user- or object- 
specific cues. 

ADL contains 20 videos from chest-mounted cameras, 
each on average about 30 minutes long. The camera wear¬ 
ers perform daily activities in the house, like brushing hair, 
cooking, washing dishes, or watching TV. To generate can¬ 
didate object regions on ADL, we use BING f53l . which 
generates bounding box proposals and is orders of magni¬ 
tude faster than the min-cut approach of f20l . 

We use Lab space color histograms, with 23 bins per 
channel, and optical flow histograms with 61 bins per direc¬ 
tion using l54l . We set t = 27000 and t = 2250 (i.e., a 
60 and 5 minute temporal window), for UT Ego and ADL, 
respectively. We set 9 r = 10000 and 0 P = 0.7 after visu¬ 
ally examining a few examples. We fix all parameters for all 
results. For efficiency, we process every 15th frame (i.e., 1 


fps). For Eqn.[5j we standardize each term to zero-mean and 
unit-variance using training data. 


4.2 Important region prediction accuracy 

We first evaluate our method’s ability to predict important 
regions, compared to three state-of-the-art methods: (1) the 
object-like score of [20], (2) the object-like score of ED, 
and (3) a bottom-up saliency detector j'55]|. The first two 
are high-level learned functions that predict a region’s like¬ 
lihood of overlapping a true object, whereas the third is a 
low-level detector to find regions that “stand-out”. They are 
all general-purpose metrics (not tailored to egocentric data), 
so they allow us to gauge the impact of our proposed ego¬ 
centric cues for finding important objects in video. 

We use the annotations obtained on MTurk as ground 
truth (GT) (see Sec. |3.2| ). Some frames contain more than 
one important region, and some contain none, depending 
on what the annotators deemed important. On average, each 
video contains 680 annotated frames and 280,000 test re¬ 
gions. A region r is considered to be a true positive (i.e., 
important object), if its overlap score with any GT region is 
greater than 0.5, following PASCAL convention. 

Fig. [6] shows precision-recall curves on all test regions 
across all train/test splits. Our approach predicts important 
regions significantly better than all three existing methods. 
The two high-level methods EolED can successfully find 
prominent object-like regions, and so they noticeably out¬ 
perform the low-level saliency detector. However, by focus¬ 
ing on detecting any object, unlike our approach they are 
unable to distinguish those that may be important to a cam¬ 
era wearer. 

Fig. [7] shows example important regions detected by 
each method. The first four columns show examples of cor¬ 
rect predictions made by our method. We see that low-level 



























































10 


Yong Jae Lee, Kristen Grauman 


1. size 

8. height 

15. obj app. 

22. bbox x + reg freq. 

2. size + height 

9. pt freq. 

16.x 

23. x + reg freq. 

3. y + face 

10. size + reg freq. 

17. size + x 

24. obj app. + size 

4. size + pt freq. 

11. gaze 

18. gaze + x 

25. y + interaction 

5. bbox y + face 

12. face 

19. obj app. + y 

26. width + height 

6. width 

13. y 

20. x + bbox x 

27. gaze + bbox x 

7. size + gaze 

14. size + width 

21. y + bbox x 

28. bbox y + interaction 


Fig. 8 Top 28 features with highest learned weights. 


saliency detection f55l is insufficient; its local estimates fail 
to find object-like regions. For example, it finds a bright blob 
surrounded by a dark region to be the most salient (first row, 
fourth column). 

The last four columns show examples of incorrect pre¬ 
dictions made by our method. The high-level saliency de¬ 
tection methods Boren produce better predictions for these 
examples. In the first example, our method produces an 
under-segmentation of the important object and includes re¬ 
gions surrounding the television due to the combined region 
having higher object-like appearance score than the televi¬ 
sion alone. In the second example, our method incorrectly 
detects the user’s hand to be important, while in the third 
and fourth examples, it determines background regions to 
be important due to their high frequency. 

We next perform ablation studies to investigate the con¬ 
tribution of the pairwise interaction terms of our importance 
predictor. Specifically, we compare to a linear regression 
model and an LI-regularized linear regression model using 
only the original 14-dimensional features. The average pre¬ 
cision of the linear regression model is 0.20, and the average 
precision of the LI-regularized model ranges from 0.14-0.20 
depending on the level of sparsity, as enforced by the weight 
on the regularization term. This result shows that the origi¬ 
nal features alone are not sufficiently expressive, and that the 
pairwise terms are necessary to more fully capture the rela¬ 
tionship between the features and desired importance values. 


4.3 Which cues matter most for importance? 

Fig. [8] shows the top 28 out of 105 (= 14 + ( 14 )) features 
that receive the highest learned weights (i.e., /J magnitudes). 
Region size is the highest weighted cue, which is reason¬ 
able since an important person/object is likely to appear 
roughly at a fixed distance from the camera wearer. Among 
the egocentric features, gaze and frequency have the high¬ 
est weights. Frontal face overlap is also highly weighted; 
intuitively, an important person would likely be facing and 
conversing with the camera wearer. 

Some highly weighted pair-wise interaction terms are 
also quite interesting. The feature measuring a region’s face 
overlap and y-position has more impact on importance than 
face overlap alone. This suggests that an important per¬ 
son usually appears at a fixed height relative to the camera 


wearer. Similarly, the feature for object-like appearance and 
y-position has high weight, suggesting that a camera wearer 
often adjusts his ego-frame of reference to view an impor¬ 
tant object at a particular height. 

Surprisingly, the pairing of the interaction (distance to 
hand) and frequency cues receives the lowest weight. A 
plausible explanation is that the frequency of a handled ob¬ 
ject highly depends on the camera wearer’s activity. For ex¬ 
ample, when eating, the camera wearer’s hand will be visible 
and the food will appear frequently. On the other hand, when 
grocery shopping, the important item s/he grabs from the 
shelf will (likely) be seen for only a short time. These con¬ 
flicting signals would lead to this pair-wise term having low 
weight. Another paired term with low weight is an “object¬ 
like” region that is frequent; this is likely due to unimpor¬ 
tant background objects (e.g., the lamp behind the camera 
wearer’s companion). This suggests that higher-order terms 
could yield even more informative features. 


4.4 Importance-based summarization accuracy 

Next we evaluate our method’s summarization results using 
the importance-based criterion, and in the following section 
we evaluate its budget-based results. 


4.4.1 Quantitative evaluation 


The central premise of our work is that day-to-day activ¬ 
ity viewed from the first person perspective largely revolves 
around the important people and objects with which the 
camera wearer interacts. Accordingly, a good visual sum¬ 
mary must capture those important entities. Thus, we ana¬ 
lyze the recall rate for our method and two competing sum¬ 
marization strategies. The first is uniform keyframe sam¬ 
pling, and the second is event-based adaptive keyframe sam¬ 
pling. The latter computes events using the same procedure 


as our method (Sec. 3.4), and then divides its keyframes 
evenly across events. Both methods are modeled after stan¬ 
dard keyframe and event detection methods 15611 112 ]. 

Fig. [9] shows the results. Each set of bars shows the recall 
rates for the three methods. Our method varies its selection 
criterion on I(r) over {0.2, 0.4}, for two summaries in to¬ 
tal for each user. These thresholds are used to cover a broad 
spectrum (i.e., low and high selection criteria) and are arbi¬ 
trary; we see consistent relative results for any threshold. To 
compare our recall rates to those of the baselines, we create 
summaries for the baselines with the same number of frames 
as ours. 

If a frame contains multiple important objects, we score 
only the main one. Likewise, if a summary contains multi¬ 
ple instances of the same GT object, it gets credit only once. 
Note that this measure is favorable to the baselines, since it 
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(Ours) sampling sampling High prominence Low prominence 

Fig. 10 Comparison to alternative summarization strategies, in terms 
of the prominence of the objects within selected keyframes. Our sum¬ 
maries more prominently display the important objects. 



Fig. 12 An application of our approach that shows the GPS tracks of 
the camera wearer, the important people and objects that s/he interacted 
with, and their timeline. 


does not consider object prominence in the frame. For ex¬ 
ample, we give credit for the TV in the last frame in Fig.[T0| 
bottom row, even though it is only partially captured. Fur¬ 
thermore, by definition, the uniform and event-based base¬ 
lines are likely to get many hits for the most frequent ob¬ 
jects. These make the baselines very strong and meaningful 
comparisons. 

Overall, our summaries include more important peo¬ 
ple/objects with the same number of frames. For exam¬ 
ple, for User 2 with selection criterion on I(r) > 0.2, 
our method finds 62% of important objects in 27 frames, 
whereas the uniform keyframe and event-based adaptive 
keyframe sampling methods find 54% and 46% of impor¬ 
tant objects, respectively. The lower absolute recall rate for 
all methods for User 4 is due to many small GT objects that 
appear together in the same frame (the user was cooking 
and baking). On average, we find 9.13 events/video and 2.05 
people/objects per event. 

While Fig. [9] captures the recall rate of the important 
objects, it does not measure the prominence of the objects 
in the selected frames. An informative summary should in¬ 
clude not just any instance of the important object, but 
frames in which it is displayed prominently (i.e., large and 
centered). To this end, in Fig. 10 we quantify the promi¬ 
nence of important objects in each method’s summaries, in 
terms of the distance of the region’s centroid to the frame 
center. We see our method better isolates the prominent in¬ 
stances, thanks to its egocentric cues. For example, in the top 
right example, the TV has high prominence in our summary 
and low prominence in the uniform keyframe sampling’s 
summary. 


market —» driving home -A cooking -A eating and watching 
TV. We provide additional summaries at the project web¬ 
page. 

Fig. |TT] (bottom) also depicts our method’s failure 
modes. Redundant frames of the same object can appear 
due to errors in event segmentation (see the man captured 
in Events 2 and 3) or the candidate important object cluster¬ 
ing (the sink is captured twice in Event 10). Adding features 
like GPS or depth might reduce such errors. 

Fig. [T2| shows another example where we track the cam¬ 
era wearer’s location with a GPS receiver, and display our 
method’s keyframes on a map with the tracks (purple trajec¬ 
tory) and timeline. This result suggests a novel multi-media 
application of our visual summarization algorithm that in¬ 
corporates location, temporal, and visual data. 

In all the results in this section, the two baselines 
perform fairly similarly to one another; compared to our 
method, they are prone to choosing unimportant or redun¬ 
dant frames that lack focus on those objects a human viewer 
has deemed important. This supports our main hypothesis 
that the traditional low-level cues used in generic video sum¬ 
marization methods are insufficient to select keyframes that 
capture key objects in egocentric video. Building on this 
finding, the user studies below analyze the impact that in¬ 
cluding important objects has on perceived summary qual¬ 
ity. 


4.5 Budgeted frame selection accuracy 


4.4.2 Summarization examples 


Fig. 11 shows example summaries from our method and the 
keyframe sampling baseline. The colored blocks on ours in¬ 
dicate the discovered events. We see that our summary not 
only has better recall of important objects, but it also selects 
views in which they are prominent in the frame. This helps 
more clearly reveal the story of the video. For instance, for 
the top example, the story is: selecting an item at the super- 


We next evaluate our approach for the scenario where we 
must handle requests such as, “I would like to see a 10-frame 
summary of the original video”. 

We compare our budgeted k -frame selection approach 
to four alternative methods: (1) the state-of-the-art video 
summarization method of CEE which selects keyframes that 
provide maximal content inclusion. Briefly, it iteratively se¬ 
lects the frame that is on average most similar to all remain¬ 
ing frames without being redundant to the frames that have 
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User 1 User 2 User 3 




Fig. 9 Comparison to alternative summarization strategies, in terms of important object recall rate. Using the same number of frames, our approach 
includes more important people and objects. 



(b) 



0 



0 






(b) 





<? 





Fig. 11 (a) Our summary versus (b) uniform keyframe sampling. The colored blocks for ours indicate the discovered events. Our summary focuses 
on the important people and objects. While uniform keyframe sampling does hint at the course of events, it tends to include irrelevant or redundant 
frames (e.g., repeated instances of the man in the bottom example) because it lacks a notion of object importance. 


already been chosen]^] (2) the keyframe selection approach 
of El, which optimizes an energy function that enforces ad¬ 
jacent frames to be maximally different. For fairest com¬ 
parison, we use the same x 2 -distance on color histograms 
used by our method to gauge visual dissimilarity. (3) a side- 
by-side implementation of our approach without event seg¬ 
mentation and region grouping (i.e., it selects k -frames from 
all frames of the video), and (4) uniform keyframe sam¬ 
pling. The former two contrast our method with existing 


2 This method summarizes a collection of videos, so we treat each 
event in our data as a different video. 


techniques that target the generic video summarization prob¬ 
lem, highlighting the need to specialize to egocentric data 
as we propose. The latter two isolate the impact of our im¬ 
portance predictions as well as our event segmentation and 
region grouping. 

4.5.1 Quantitative evaluation 

The plots in Fig. [T3| show the results. We plot % of important 
objects found as a function of # of frames in the summary , in 
order to analyze both the recall rate of the important objects 
as well as the compactness of the summaries. Each point on 
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the curve shows the result for a different summary of the 
required length. We score the objects found in the same way 
as in Section 14.4.1 1 

Our model significantly outperforms the keyframe 
method EQ , which confirms that modeling the importance of 
the object or person is critical to produce informative sum¬ 
maries for egocentric videos. In fact, the existing method 
performs even worse than uniform sampling, due to its pref¬ 
erence for frames that are maximally dissimilar to their sur¬ 
rounding selected frames. As a result, it tends to select re¬ 
dundant frames containing the same visual elements in an 
alternating fashion. Our summary does not have this issue 
since we represent each object in each event with a single 
region/frame through region clustering. 

Our model also outperforms the multi-document 
method rm on all but one user. While this prior method 
successfully selects diverse content throughout the video, 
its reliance on low-level image cues leads to choosing some 
non-essential frames. 

With very short summaries, uniform sampling performs 
similarly to ours; the selected keyframes are more spread 
out in time and have a high chance of including unique peo¬ 
ple/objects. However, with longer summaries, our method 
always outperforms uniform sampling, since uniform sam¬ 
pling ignores object importance and tends to include frames 
repeating the same important object. 

Our model also outperforms the baseline that selects k- 
frames from the entire video without event segmentation and 
region grouping (“No events”). Since this method does not 
group instances of the same object together, it can select the 
same important object multiple times. 

4.5.2 Summarization examples 

Fig. [14] shows example summaries created by each method. 
By focusing on the important people and objects, our 
method produces the best results. 


4.6 User studies to evaluate summaries 

We next perform user studies, since ultimately the impact 
of a summary depends on its value to a human viewer. As 
subjects, we recruit both the camera wearers as well as 25 
subjects uninvolved with the data collection or research in 
any way. The camera wearers are a valuable resource to dis¬ 
cern summary quality, since they alone fully experienced 
the original content. Complementary to that, the uninvolved 
subjects are valuable to objectively gauge whether the over¬ 
all events are understandable—without the implicit benefit 
of being able to “fill in the gaps” with their own firsthand 
experience of the events being summarized. 



Much better 

Better 

Similar 

Worse 

Much worse 

Imp. captured 

31.25% 

37.5% 

18.75% 

12.5% 

0% 

Overall quality 

25% 

43.75% 

18.75% 

12.5% 

0% 


Table 1 Camera wearer user study results. 


4.6.1 Evaluation by the camera wearers 

To quantify perceived quality, we ask the camera wearers 
to compare our method’s summaries to those generated by 
uniform keyframe sampling. The camera wearers are good 
judges, since they know the full extent of their day that we 
are attempting to summarize. 

We generate four pairs of summaries for each user, each 
of different length. We ask the subjects to view our sum¬ 
mary and the baseline’s (in some random order unknown 
to the subject, and different for each pair), and answer two 
questions: (1) Which summary captures the important peo¬ 
ple/objects of your day better? and (2) Which provides a 
better overall summary? The first specifically isolates how 
well each method finds important, prominent objects, and 
the second addresses the overall quality and story of the 
summary. 

Table[l]shows the results, in terms of how often our sum¬ 
mary is preferred. In short, out of 16 total comparisons, our 
summaries were found to be better 68.75% of the time. We 
find our approach can fail to produce better summaries than 
uniform keyframe sampling if the user’s day is very simple. 
Specifically, User 3 was working on her laptop the entire 
day; first at home, then at class, then during lunch, and fi¬ 
nally at the library. For this video, uniform keyframe sam¬ 
pling was sufficient to produce a good summary. 

4.6.2 Evaluation by independent subjects 

Next, to measure the quality of our summary on an abso¬ 
lute scale and to allow independent judges to evaluate a vi¬ 
sual summary’s informativeness, we ask each camera wearer 
to provide a “ground-truth” text summary of his/her day. 
Specifically, we ask the users to provide full sentence de¬ 
scriptions that emphasize the key happenings (i.e., who s/he 
met, what s/he did, where, and when), and in sequential or¬ 
der as they happened that day. The resulting text summaries 
are 6-10 sentences long. Here is an example from User 2: 

“My boyfriend and I drove to a farmers market in the 
early afternoon, where we sampled some food. Then (also 
in the early afternoon) we drove to a pizza place, where 
we stayed for a while, talked, had pizza, drank beer, and 
watched TV. After that, in the afternoon, we walked to a 
frozen yogurt place and split a cup of frozen yogurt, with 
brief looks at an animation that was playing. Then we 
walked around for a while, and drove home in the early 
evening. At home, we played with Legos for a while, in the 
living room. Then we watched some videos on YouTube. Af- 













14 


Yong Jae Lee, Kristen Grauman 





^Length budget (Ours) 


10 20 30 

# of frames in summary 



Fig. 13 Comparison to alternative /c-frame summarization strategies. Our budgeted frame selection approach produces more informative sum¬ 
maries with fewer frames. 



Much better 

Better 

Similar 

Worse 

Much worse 

keyframes [4j 

16.43% 

45.45% 

13.99% 

18.88% 

5.25% 

multi-document 1131 

21.08% 

36.14% 

17.47% 

17.47% 

7.84% 

uniform sampling 

10.22% 

37.63% 

14.52% 

29.03% 

8.60% 


Table 2 Mechanical Turk user study results on UT Ego. 



Much better 

Better 

Similar 

Worse 

Much worse 

keyframes |4j 

26.80% 

41.24% 

17.01% 

11.34% 

3.61% 

multi-document 1131 

13.90% 

28.88% 

25.67% 

26.20% 

5.35% 

uniform sampling 

11.95% 

33.96% 

20.75% 

25.79% 

7.55% 


Table 3 Mechanical Turk user study results on ADL. 


ter that we played with Legos some more, and I washed some 
dishes in the kitchen, in the evening.” See the supplementary 
file for the remaining text summaries. 

We then ask 25 subjects using Mechanical Turk to com¬ 
pare our summary and the baselines’ (without knowing 
which method generated the summary) to the text summary 
provided by the camera wearer of the corresponding video, 
and answer: How well does the visual summary follow the 
story of the text summary? On a scale of 1 to 5 (1 being “very 
well” and 5 being “very poorly”), over all 16 summaries, 
ours scored 2.61 (±0.97). The prior methods (4|, fl3lL and 
uniform sampling scored only 3.43 (±1.05), 3.28 (±1.10), 
2.94 (±1.09), respectively. In general, the judges found the 
longer summaries to better align with the corresponding text 
summary than the shorter summaries. On some videos, our 
shorter summaries failed to capture all of the details in the 
text summary, resulting in poor scores. 

While the result above gauges quality on an absolute 
scale, we also ran a comparative test. Here, we ask the sub¬ 
jects to compare our summary and each baseline’s (in ran¬ 
dom order) to the text summary, and answer: Which visual 
summary more closely follows the story of the text summary? 
Table [2] shows the accumulated responses from all 25 sub¬ 
jects. Out of 16 total comparisons to each baseline, our sum¬ 
maries were found to be better 48-62% of the time, and only 
worse 24-38% of the time. 


4.7 Experiments on ADL 

Finally, we perform experiments on ADL, an interesting and 
complimentary dataset to UT Ego that contains egocentric 
videos of people performing daily activities in their home 


(e.g., washing dishes, brushing teeth, etc.). It contains 20 
videos, each roughly 30 minutes in length. 

Since this data lacks ground-truth important object an¬ 
notations, we use it only to evaluate our summaries. We take 
the importance predictor from UT Ego (trained on all four 
videos), and use it to predict region importance on the ADL 
videos. We use our budgeted frame selection approach and 
set the summary frame-length to k = 8 (an arbitrary but 
reasonable number given the short length of ADL videos). 
For each video, we ask an independent subject to watch the 
video and provide a text summary that emphasizes the key 
happenings, in the same manner as described in Sec |4.6.2 


The resulting summaries tend to focus on specific actions 
and are more descriptive than those provided by the camera 
wearers on UT Ego. We suspect this is due to the relatively 
short length of each video (~3sll0 minutes). Here is an ex¬ 
ample summary: 

“A guy brought his laundry basket to the laundry room 
to do laundry. He poured in the liquid detergent and did his 
laundry. He then went back home and started to play a video 
game on TV. The guy went into his room and turned on his 
laptop computer and looked at a picture of a monkey. The 
guy went into the bathroom to wash his face and brush his 
teeth. The guy is now in his kitchen and poured some juice 
to drink. He's looking at a list and checking off his list. The 
guy is making tea. The guy went into the bathroom to comb 
his hair. The guy cleaned his kitchen floor with a broom. 
He then went into his bedroom and put on his shoes.” See 
supplementary file for all text summaries. 

We then ask 10 Mechanical Turk subjects per video 
to compare our summaries to those of uniform sampling, 
keyframes f4), and multi-document [13], and ask the same 
set of questions as in Sec. 4.6.2| Table [3] shows the re¬ 
sults. Out of 20 total pairwise comparisons to each base¬ 
line, our summaries were found to be better 42-68% of 
the time, and worse 15-33% of the time. In terms of how 
each method’s summary compares to the text summary, 
ours, 0, ED, and uniform sampling scored 2.71 (±1.02), 
3.58 (±0.97), 2.99 (±1.15), 2.89 (±1.06), respectively (re¬ 
call that lower numbers are better: 1 being “very well” 
and 5 being “very poorly”). We show clear improvement 
over keyframes 0], which tends to simply oscillate between 
bright/dark frames. Our improvements over uniform sam¬ 
pling and multi-document mo are less compared to those 
























































Predicting Important Objects for Egocentric Video Summarization 


15 


I Event 1 I 




oir* 


I Event 2 I 


* 



0 






Event 3] _ Event 

0 


n 



I Event 10 I 



0 


I Event 9 I _ I Event 8 I 


O 




I Event7 I _ I Event6 I _ I Events I 

* 



Length budget (Ours) 










No events 









Keyframes (Liu and Kender 2002) 
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Fig. 14 Example summaries per method on UT Ego. The “No events” baseline can include redundant frames because it lacks event segmentation 
and region clustering to group instances of the same object together (e.g., the yellow notepad and man). Keyframe selection (4) focuses on selecting 
adjacent frames that are maximally dissimilar, leading it to toggle between highly diverse frames, which need not capture important objects. While 
the multi-document summarization objective cm overcomes this toggling effect, both it and uniform keyframe sampling tend to select redundant 
frames (e.g., see repeated instances of the man). Overall, our summary best focuses on the important people and objects. It selects informative 
frames that convey the chain of events through the objects and people that drive the first person interactions. 
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* 



0 



Length budget (Ours) 





0 


t. 




Multi-document (Weng and Merialdo 2009) 



Uniform keyframe sampling 


Fig. 15 Example summaries per method on ADL. Keyframe selection (4) focuses on selecting adjacent frames that are maximally dissimilar, lead¬ 
ing it to toggle between highly diverse frames, which need not capture important objects. While the multi-document summarization objective na 
overcomes this toggling effect, both it and uniform keyframe sampling can select irrelevant frames (e.g., see 2nd and 7th columns). Overall, our 
summary selects informative frames that best focus on the important objects that drive the first person interactions. 


on UT Ego. This is likely due to the ADL videos being 
shorter in length and more structured; in ADL, the cam¬ 
era wearers are given a list of actions they should perform, 
whereas UT Ego is completely unscripted. Under these con¬ 
ditions, summarization algorithms that aim to select frames 
that are spread-out over time are likely to select meaningful 
frames. Still, by focusing on the important objects, our ap¬ 
proach produces the best summaries. Lig. [ 13 ] shows example 
summaries created by each method. Our method selects the 
most informative frames. 

Overall, the results are a promising indication that dis¬ 
covering important people and objects leads to higher qual¬ 
ity summaries for egocentric video. Not only do we better 
recount those objects that human viewers deem important in 
the context of the surrounding activity, but we also generate 
summaries that human viewers prefer to multiple existing 
summarization approaches. 

5 Conclusion and Future Work 

We introduced an approach to summarize egocentric video 
using novel egocentric cues to predict important regions. We 
presented two ways to adjust summary compactness: given 
either an importance selection criterion or a length budget. 
Lor the latter, we developed an efficient optimization strat¬ 
egy to recover the best k -frame summary. To our knowledge, 
ours is the first work to summarize videos from wearable 
cameras by discovering objects that may be important to 
the camera wearer. Existing summarization techniques rely 
on static cameras or low-level visual similarity, and so they 


fail to account for the key objects that drive first person in¬ 
teractions. Through extensive experiments, we showed that 
our approach produces significantly more informative sum¬ 
maries than prior methods. 

Luture work can expand this idea in several interest¬ 
ing directions. We assumed that the importance cues can be 
learned and shared across users, and our experiments con¬ 
firmed that it is feasible. However, there are also subjective 
elements; e.g., depending on the user, a person that he has 
significant interactions with may or may not be considered 
important. To overcome the subjectivity, one could learn a 
wearer-specific model that uses input from the wearer for 
training to complement our wearer-independent model. 

Secondly, event segmentation remains a challenge for 
egocentric data. With the frequent head and body motion 
inherent to wearable video, grouping frames according to 
low-level scene statistics is imperfect. In our system, this 
can sometimes lead to redundant keyframes showing the 
same object. One way to mitigate this issue is to use a GPS 
receiver and generate event clusters using both location in¬ 
formation and scene appearance. This could provide better 
separation of events, especially when the scene appearance 
between two neighboring events is similar. More broadly, 
more robust detection of event boundaries is needed. 

Linally, while our interest lies in the computer vision 
challenges, other sensing modalities naturally can play a 
role in egocentric summarization. Lor example, audio cues 
could signal person importance based on their speech near 
the camera, while ambient noise may be indicative of the 
scene type. Other sensors like an accelerometer can reveal 
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the user’s gestures and activity, while GPS coordinates could 
give real-world location context relevant to which objects 
are likely important (e.g., a plate in a restaurant, vs. an ath¬ 
lete in a stadium). 
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